Finding Low Error Clusterings

نویسندگان

Maria-Florina Balcan

Mark Braverman

چکیده

A common approach for solving clustering problems is to design algorithms to approximately optimize various objective functions (e.g., k-means or min-sum) defined in terms of some given pairwise distance or similarity information. However, in many learning motivated clustering applications (such as clustering proteins by function) there is some unknown target clustering; in such cases the pairwise information is merely based on heuristics and the real goal is to achieve low error on the data. In these settings, an arbitrary c-approximation algorithm for some objective would work well only if any c-approximation to that objective is close to the target clustering. In recent work, Balcan et. al [7] have shown how both for the k-means and k-median objectives this property allows one to produce clusterings of low error, even for values c such that getting a c-approximation to these objective functions is provably NP-hard. In this paper we analyze the min-sum objective from this perspective. While [7] also considered the min-sum problem, the results they derived for this objective were substantially weaker. In this work we derive new and more subtle structural properties for min-sum in this context and use these to design efficient algorithms for producing accurate clusterings, both in the transductive and in the inductive case. We also analyze the correlation clustering problem from this perspective, and point out interesting differences between this objective and k-median, k-means, or min-sum objectives.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Agnostic Clustering

Motivated by the principle of agnostic learning, we present an extension of the model introduced by Balcan, Blum, and Gupta [3] on computing low-error clusterings. The extended model uses a weaker assumption on the target clustering, which captures data clustering in presence of outliers or ill-behaved data points. Unlike the original target clustering property, with our new property it may no ...

متن کامل

Distributed and Provably Good Seedings for k-Means in Constant Rounds

The k-means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-means‖ algorithm which reduces the number of sequential rou...

متن کامل

Iterative Optimization and Simpliication of Hierarchical Clusterings

Clustering is often used for discovering structure in data. Clustering systems diier in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can ...

متن کامل

On Finding Graph Clusterings with Maximum Modularity

Modularity is a recently introduced quality measure for graph clusterings. It has immediately received considerable attention in several disciplines, and in particular in the complex systems literature, although its properties are not well understood. We study the problem of finding clusterings with maximum modularity, thus providing theoretical foundations for past and present work based on th...

متن کامل